release-19.1: batcheval: estimate stats in EvalAddSSTable #36525

dt · 2019-04-04T12:53:06Z

Backport 1/1 commits from #35231.

/cc @cockroachdb/release

Adding the SSTs stats directly, without removing the underlying stats and computing merged stats,
avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when
repeatedly adding sparse, wide SSTs to existing, full-ish ranges.

When adding one big SST to an empty range, the previous approach was not particularly expensive:
recomputing the existing stats was cheap, as the underying range was empty-ish, and computing
the merged stats was able to mostly iterate sequentially over just the SST.

However when adding an SST with keys distributed over an existing, non-empty range, the
recomputation of the existing span stats actually has to read and process a large amount
of existing data, and the merged recompute has to iterate over it as well -- and in doing
so has to jump between the go-side merge iterator and the underlying Rocks-backed range.

Instead, we can optimistically assume all the keys/data in the SST being added is in fact
being added -- i.e. not shadowing existing keys -- and simply add its stats to the range
stats without recomputing. This will be incorrect in some cases -- when a key does shadow
or, in particular, when retrying and all keys are shadowed -- but by flipping the flag
in the stats to indicate that the stats contain estimates, we can document our assumption's
imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent
a consistency checker error being raised.

These estimated stats are probably good enough 'as-is', but a followup could send an explicit
CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to
expedite the correction of any inccorrect estimates.

Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.

Adding the SSTs stats directly, without removing the underlying stats and computing merged stats, avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when repeatedly adding sparse, wide SSTs to existing, full-ish ranges. When adding one big SST to an empty range, the previous approach was not particularly expensive: recomputing the existing stats was cheap, as the underying range was empty-ish, and computing the merged stats was able to mostly iterate sequentially over just the SST. However when adding an SST with keys distributed over an existing, non-empty range, the recomputation of the existing span stats actually has to read and process a large amount of existing data, and the merged recompute has to iterate over it as well -- and in doing so has to jump between the go-side merge iterator and the underlying Rocks-backed range. Instead, we can optimistically assume all the keys/data in the SST being added is in fact being added -- i.e. not shadowing existing keys -- and simply add its stats to the range stats without recomputing. This will be incorrect in some cases -- when a key does shadow or, in particular, when retrying and all keys are shadowed -- but by flipping the flag in the stats to indicate that the stats contain estimates, we can document our assumption's imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent a consistency checker error being raised. These estimated stats are probably good enough 'as-is', but a followup could send an explicit CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to expedite the correction of any inccorrect estimates. Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.

cockroach-teamcity · 2019-04-04T12:53:16Z

This change is

tbg · 2019-04-04T12:56:34Z

LGTM

…

On Thu, Apr 4, 2019 at 2:53 PM cockroach-teamcity ***@***.***> wrote: This change is [image: Reviewable] <https://reviewable.io/reviews/cockroachdb/cockroach/36525> — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#36525 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135LkYx0mzhiji8wxyGfnhcwoJySuDks5vdfXHgaJpZM4ccuYn> .

dt requested review from bdarnell, tbg, nvanbenschoten and a team April 4, 2019 12:53

dt merged commit 5a76863 into cockroachdb:release-19.1 Apr 4, 2019

dt deleted the backport19.1-35231 branch April 4, 2019 17:18

tbg mentioned this pull request Apr 18, 2019

storage: consistency check failure during import #36861

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-19.1: batcheval: estimate stats in EvalAddSSTable #36525

release-19.1: batcheval: estimate stats in EvalAddSSTable #36525

dt commented Apr 4, 2019

cockroach-teamcity commented Apr 4, 2019

tbg commented Apr 4, 2019 via email

release-19.1: batcheval: estimate stats in EvalAddSSTable #36525

release-19.1: batcheval: estimate stats in EvalAddSSTable #36525

Conversation

dt commented Apr 4, 2019

cockroach-teamcity commented Apr 4, 2019

tbg commented Apr 4, 2019 via email